Breast Cancer Wisconsin (Diagnostic) Dataset


Breast Cancer Wisconsin (Diagnostic) Dataset

In this article, we compare a number of classification methods for the breast cancer dataset. The details regarding this dataset can be found in Diagnostic Wisconsin Breast Cancer Database [1]. We would use the following classification methods and then compare them in terms of performance.

Dataset

As can be seen, the number of instances is 569 and the number of attributes is 32. The object of the exercise is to create a classification model that can classify the type of Diagnosis base on the rest of the attributes. However, first, let's plot a count plot for Diagnosis attribute.

X and y sets

Features with high variance

Moreover, high variance for some features can hurt our modeling process. For this reason, we would like to standardize features by removing the mean and scaling to unit variance.

Training and testing sets

StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

Modeling: Support Vector Machine

Support-vector machines are supervised learning models that can be used for classification and regression analysis. Please see Support Vector Machines from Statistical Learning, and this link for more details.

Some of the metrics that we use here to mesure the accuracy: \begin{align} \text{Confusion Matrix} = \begin{bmatrix}T_p & F_p\\ F_n & T_n\end{bmatrix}. \end{align}

where $T_p$, $T_n$, $F_p$, and $F_n$ represent true positive, true negative, false positive, and false negative, respectively.

\begin{align} \text{Precision} &= \frac{T_{p}}{T_{p} + F_{p}},\\ \text{Recall} &= \frac{T_{p}}{T_{p} + F_{n}},\\ \text{F1} &= \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\\ \text{Balanced-Accuracy (bACC)} &= \frac{1}{2}\left( \frac{T_{p}}{T_{p} + F_{n}} + \frac{T_{n}}{T_{n} + F_{p}}\right ) \end{align}

The accuracy can be a misleading metric for imbalanced data sets. In these cases, a balanced accuracy (bACC) [4] is recommended that normalizes true positive and true negative predictions by the number of positive and negative samples, respectively, and divides their sum by two.

Support Vector Machine with Default Parameters

Support Vector Machine with the Best Parameters

In order to find the parameters for our model, we can sue RandomizedSearchCV. Here, we have defined a function Best_Parm to find the best parameters.

Since we have identified the best parameters for our modeling, we train another model using these parameters.


References


  1. UC Irvine Machine Learning Repository: Breast Cancer Wisconsin (Diagnostic) Data Set
  2. scikit-learn Support Vector Machines
  3. Support-Vector Machine Wikipedia page
  4. Mower, Jeffrey P. "PREP-Mt: predictive RNA editor for plant mitochondrial genes." BMC bioinformatics 6.1 (2005): 1-15.